3 research outputs found

    On the use of word embedding for cross language plagiarism detection

    Full text link
    [EN] Cross language plagiarism is the unacknowledged reuse of text across language pairs. It occurs if a passage of text is translated from source language to target language and no proper citation is provided. Although various methods have been developed for detection of cross language plagiarism, less attention has been paid to measure and compare their performance, especially when tackling with different types of paraphrasing through translation. In this paper, we investigate various approaches to cross language plagiarism detection. Moreover, we present a novel approach to cross language plagiarism detection using word embedding methods and explore its performance against other state-of-the-art plagiarism detection algorithms. In order to evaluate the methods, we have constructed an English-Persian bilingual plagiarism detection corpus (referred to as HAMTA-CL) comprised of seven types of obfuscation. The results show that the word embedding approach outperforms the other approaches with respect to recall when encountering heavily paraphrased passages. On the other hand, translation based approach performs well when the precision is the main consideration of the cross language plagiarism detection system.Asghari, H.; Fatemi, O.; Mohtaj, S.; Faili, H.; Rosso, P. (2019). On the use of word embedding for cross language plagiarism detection. Intelligent Data Analysis. 23(3):661-680. https://doi.org/10.3233/IDA-183985S661680233H. Asghari, K. Khoshnava, O. Fatemi and H. Faili, Developing bilingual plagiarism detection corpus using sentence aligned parallel corpus: Notebook for {PAN} at {CLEF} 2015, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.A. Barrón-Cede no, M. Potthast, P. Rosso and B. Stein, Corpus and evaluation measures for automatic plagiarism detection, In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner and D. Tapias, editors, Proceedings of the International Conference on Language Resources and Evaluation, {LREC} 2010, 17–23 May 2010, Valletta, Malta. European Language Resources Association, 2010.A. Barrón-Cede no, P. Rosso, D. Pinto and A. Juan, On cross-lingual plagiarism analysis using a statistical model, In B. Stein, E. Stamatatos and M. Koppel, editors, Proceedings of the ECAI’08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008, volume 377 of {CEUR} Workshop Proceedings. CEUR-WS.org, 2008.Farghaly, A., & Shaalan, K. (2009). Arabic Natural Language Processing. ACM Transactions on Asian Language Information Processing, 8(4), 1-22. doi:10.1145/1644879.1644881J. Ferrero, F. Agnès, L. Besacier and D. Schwab, A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection, In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk and S. Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation {LREC} 2016, Portorož, Slovenia, May 23–28, 2016, European Language Resources Association {(ELRA)}, 2016.Franco-Salvador, M., Gupta, P., Rosso, P., & Banchs, R. E. (2016). Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowledge-Based Systems, 111, 87-99. doi:10.1016/j.knosys.2016.08.004Franco-Salvador, M., Rosso, P., & Montes-y-Gómez, M. (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing & Management, 52(4), 550-570. doi:10.1016/j.ipm.2015.12.004C.K. Kent and N. Salim, Web based cross language plagiarism detection, CoRR, abs/0912.3, 2009.McNamee, P., & Mayfield, J. (2004). Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval, 7(1/2), 73-97. doi:10.1023/b:inrt.0000009441.78971.beT. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient estimation of word representations in vector space, CoRR, abs/1301.3, 2013.S. Mohtaj, B. Roshanfekr, A. Zafarian and H. Asghari, Parsivar: A language processing toolkit for persian, In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis and T. Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018, European Language Resources Association ELRA, 2018.R.M.A. Nawab, M. Stevenson and P.D. Clough, University of Sheffield – Lab Report for {PAN} at {CLEF} 2010, In M. Braschler, D. Harman and E. Pianta, editors, {CLEF} 2010 LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy, volume 1176 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2010.G. Oberreuter, G. L’Huillier, S.A. Rios and J.D. Velásquez, Approaches for intrinsic and external plagiarism detection – Notebook for {PAN} at {CLEF} 2011, In V. Petras, P. Forner and P.D. Clough, editors, {CLEF} 2011 Labs and Workshop, Notebook Papers, 19–22 September 2011, Amsterdam, The Netherlands, volume 1177 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2011.Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51-60. doi:10.1016/j.jalgor.2009.02.005M. Potthast, A. Barrón-Cede no, A. Eiselt, B. Stein and P. Rosso, Overview of the 2nd international competition on plagiarism detection, In M. Braschler, D. Harman and E. Pianta, editors, {CLEF} 2010 LABs and Workshops, Notebook Papers, 22–23 September 2010, Padua, Italy, volume 1176 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2010.Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2010). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45-62. doi:10.1007/s10579-009-9114-zM. Potthast, A. Eiselt, A. Barrón-Cede no, B. Stein and P. Rosso, Overview of the 3rd international competition on plagiarism detection, In V. Petras, P. Forner and P.D. Clough, editors, {CLEF} 2011 Labs and Workshop, Notebook Papers, 19–22 September 2011, Amsterdam, The Netherlands, volume 1177 of {CEUR} Workshop Proceedings. CEUR-WS.org, 2011.M. Potthast, S. Goering, P. Rosso and B. Stein, Towards data submissions for shared tasks: First experiences for the task of text alignment, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.Potthast, M., Stein, B., & Anderka, M. (s. f.). A Wikipedia-Based Multilingual Retrieval Model. Advances in Information Retrieval, 522-530. doi:10.1007/978-3-540-78646-7_51B. Pouliquen, R. Steinberger and C. Ignat, Automatic identification of document translations in large multilingual document collections, CoRR, abs/cs/060, 2006.B. Stein, E. Stamatatos and M. Koppel, Proceedings of the ECAI’08 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, Patras, Greece, July 22, 2008, volume 377 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2008.J. Wieting, M. Bansal, K. Gimpel and K. Livescu, Towards universal paraphrastic sentence embeddings, CoRR, abs/1511.0, 2015.V. Zarrabi, J. Rafiei, K. Khoshnava, H. Asghari and S. Mohtaj, Evaluation of text reuse corpora for text alignment task of plagiarism detection, In L. Cappellato, N. Ferro, G.J.F. Jones and E. SanJuan, editors, Working Notes of {CLEF} 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015, volume 1391 of {CEUR} Workshop Proceedings, CEUR-WS.org, 2015.Barrón-Cedeño, A., Gupta, P., & Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, 211-217. doi:10.1016/j.knosys.2013.06.01
    corecore